[internal-dns] register and publish ddmd in the switch zone#10381
[internal-dns] register and publish ddmd in the switch zone#10381zeeshanlakhani wants to merge 1 commit intomainfrom
Conversation
DDMD has always run in the switch zone alongside Dendrite, MGS, and MGD, but it was never registered in internal DNS, leaving no path for a cross-host consumer to discover it. This adds `ServiceName::Ddm`, plumbs `ddm_port` through the host-zone switch (RSS plan + reconfigurator DNS execution), threads an `Overridables::ddm_ports` map for the test suite, and lands a `DdmInstance` dropshot sim in test utils so that the test harness registers a real DDM port in DNS the same way it does for the other switch-zone services. We also drop the duplicate DDMD_PORT const in `ddm-admin-client` in favor of the canonical `omicron_common::address::DDMD_PORT`. Same-host callers continue to use `Client::localhost()`. This was extracted from the multicast PR (zl/multicast-mgd-ddm), which uses ddmd cross-host as the first DNS-resolved consumer, as Nexus is the consumer.
jgallagher
left a comment
There was a problem hiding this comment.
Thank you very much for splitting this out!
| dendrite_port: u16, | ||
| mgs_port: u16, | ||
| mgd_port: u16, | ||
| ddm_port: u16, |
There was a problem hiding this comment.
I think this function taking 3 u16s in a row was already sketchy - what do you think about changing this to take something like
struct HostSwitchZonePorts {
dendrite: u16,
mgs: u16,
mgd: u16,
ddm: u16,
}instead of four separate arguments? That way we get named parameters for the call sites, and don't have to rely on them getting the order correct, and inside this function, we can destructure them:
let HostSwitchZonePorts { dendrite, ...all the rest ... } = ports;so we also get a compile-time check that we update this function if we change the struct?
| /// In-process stand-in for the `ddmd` (Delay Driven Multipath daemon) | ||
| /// admin API. | ||
| /// | ||
| /// `ddmd` runs in sled global zones and switch zones in real deployments, |
There was a problem hiding this comment.
Will the ddmd that runs in the global zones ever need to be in DNS too, or is that one only communicated with locally?
There was a problem hiding this comment.
Currently no.
Sled global-zone ddmd is only ever accessed by its own host, where DdmReconciler advertises that sled's own prefixes or RSS / wicketd / installinator do bootstrap-time peer discovery via Client::localhost(). That's all for per-host protocol semantics, not cross-host service discovery.
The new consumer for this work is the multicast RPW in Nexus, which talks to switch-zone ddmd cross-host to read DDM peer topology so it can map sled-id to switch port. That's the only cross-host case today, and the only reason the switch-zone ddmd needs DNS publication.
There's probably a future case where we'd want to centralize sled-ddmd state across the rack, but for another time.
I've updated the doc comment so the asymmetry between switch-zone (DNS-published) and sled-global-zone (local) access is made explicit.
| /// internal DNS as `ServiceName::Ddm`. | ||
| /// | ||
| /// This currently has no registered routes. Any integration needing | ||
| /// concrete endpoints (e.g., peer lists) must extend the `ApiDescription`. |
There was a problem hiding this comment.
This seems a little fishy - I think the other test-utils variants of these services is running something approximating the "real" service (albeit a stub binary for non-illumos systems - it still handles all the normal API endpoints). If this doesn't serve the same API, how can tests use it?
There was a problem hiding this comment.
Yeah, agreed.
Context: this came out of the multicast work. The multicast RPW in Nexus needs to resolve ServiceName::Ddm cross-host to read DDM peer topology, so we needed nexus-test-utils to publish a real DDM port in DNS. #10346 has a richer dropshot sim with /peers, set_peers, and a PeerMap for the multicast tests to sim against. I trimmed this down to bind-only because no test in this PR exercises any DDM endpoints.
Why not just subprocess the real
ddmd?
ddmd has no "admin-only" mode like mgd does (via --no-bgp-dispatcher). It always brings up the discovery/exchange/routing daemons, which need real network interfaces and illumos-only kernel facilities. So we can't tokio::process::Command::new("ddmd") the way MgdInstance does for mgd, and probably why this wasn't done beforehand.
The proper fix is upstream adding an "admin-only"-like mode to ddmd in maghemite as a separate PR, then bumping the pin here, switching DdmInstance to spawn a real ddmd via subprocess (matching MgdInstance's pattern), and then wiring ddmd into the xtask download scheme. So, let me hook up a maghemite piece to this.
DDMD has always run in the switch zone alongside Dendrite, MGS, and MGD, but it was never registered in internal DNS, leaving no path for a cross-host consumer to discover it. This adds
ServiceName::Ddm, plumbsddm_portthrough the host-zone switch (RSS plan + reconfigurator DNS execution), threads anOverridables::ddm_portsmap for the test suite, and lands aDdmInstancedropshot sim in test utils so that the test harness registers a real DDM port in DNS the same way it does for the other switch-zone services.We also drop the duplicate DDMD_PORT const in
ddm-admin-clientin favor of the canonicalomicron_common::address::DDMD_PORT. Same-host callers continue to useClient::localhost().This was extracted from the multicast PR (zl/multicast-mgd-ddm), which uses ddmd cross-host as the first DNS-resolved consumer, as Nexus is the consumer.